8-Apache Hive函数高阶应用、性能调优

hadoop离线day08-Apache Hive函数高阶应用、性能调优


今日课程学习目标

掌握explode函数、侧视图使用
掌握行列转换、json数据处理
掌握窗口函数的使用
知道Hive数据压缩、文件存储格式
掌握Hive通用调优(重要的见下述大纲)

今日课程内容大纲

#Hive函数高阶应用(面试笔试、开发高频区域)
	explode(UDTF)函数功能
	lateral view 侧视图
	行列转换
	json格式数据解析
	窗口函数(Window function)开窗函数
		分组TopN、级联累加问题、连续登陆
#Hive的性能调优
	hive的数据文件格式  数据压缩
		行式存储 列式存储(ORC parquet)
	hive通用调优	
		*join优化
		*group by数据倾斜优化
		*task并行度问题
		其他通用调优	

知识点01:Apache Hive--explode函数的使用与限制(UDTF表生成函数)


知识点02:Apache Hive--lateral view侧视图的使用

侧视图的原理是将UDTF的结果构建成一个类似于视图的表,然后将原表中的每一行和UDTF函数输出的每一行进行连接,生成一张新的虚拟表


知识点03:Apache Hive--行列转换--多行转单列(collect_list、concat_ws)


知识点04:Apache Hive--行列转换--单列转多行(explode、lateral view)


知识点05:Apache Hive--json格式数据处理

具体哪个方法好,看具体业务场景


知识点06:Apache Hive--窗口函数--快速理解与语法规则

1、快速理解窗口函数功能

8-Apache Hive函数高阶应用、性能调优_image.png

--建表加载数据
CREATE TABLE employee(
       id int,
       name string,
       deg string,
       salary int,
       dept string
) row format delimited
    fields terminated by ',';

load data local inpath '/root/hivedata/employee.txt' into table employee;

select * from employee;

----sum+group by普通常规聚合操作------------
select dept,sum(salary) as total from employee group by dept;

select id,dept,sum(salary) as total from employee group by dept; --添加id至结果,错误sql

+-------+---------+
| dept  |  total  |
+-------+---------+
| AC    | 60000   |
| TP    | 120000  |
+-------+---------+

----sum+窗口函数聚合操作------------
select id,name,deg,salary,dept,sum(salary) over(partition by dept) as total from employee;

+-------+-----------+----------+---------+-------+---------+
|  id   |   name    |   deg    | salary  | dept  |  total  |
+-------+-----------+----------+---------+-------+---------+
| 1204  | prasanth  | dev      | 30000   | AC    | 60000   |
| 1203  | khalil    | dev      | 30000   | AC    | 60000   |
| 1206  | kranthi   | admin    | 20000   | TP    | 120000  |
| 1202  | manisha   | cto      | 50000   | TP    | 120000  |
| 1201  | gopal     | manager  | 50000   | TP    | 120000  |
+-------+-----------+----------+---------+-------+---------+

2、窗口函数语法规则

具有OVER语句的函数叫做窗口函数。
Function OVER ([PARTITION BY <...>] [ORDER BY <....>] [<window_expression>])

--1、Function可以是下面分类中的任意一个
	--聚合函数:比如sum、max、avg、max、min等
    --排序函数:比如rank、row_number等
    --分析函数:比如lead、lag、first_value等

--2、OVER 窗口函数语法关键字与标识

--3、PARTITION BY <...>功能类似于group by,用于指定分组,相同的分为一组。如果没有指定PARTITION BY,那么整张表的所有行就是一组;

--4、ORDER BY <....> 用于指定每个分组内的数据排序规则 ,默认是升序ASC,支持ASC、DESC;

--5、window_expression window表达式,也叫window子句,用于指定每个窗口中操作的数据范围
	⭐⭐⭐
	有row between 3-5 表示从3到5行
	还有range between 3-5 表示值从3到5
---建表并且加载数据
create table website_pv_info(
   cookieid string,
   createtime string,   --day
   pv int
) row format delimited
fields terminated by ',';

create table website_url_info (
    cookieid string,
    createtime string,  --访问时间
    url string       --访问页面
) row format delimited
fields terminated by ',';


load data local inpath '/root/hivedata/website_pv_info.txt' into table website_pv_info;
load data local inpath '/root/hivedata/website_url_info.txt' into table website_url_info;

select * from website_pv_info;
select * from website_url_info;

知识点07:Apache Hive--窗口函数--聚合函数

⭐⭐⭐HIVE SQL 聚合函数与 rows between、 range between详解

知识点08:Apache Hive--窗口函数--window子句

直译叫做window表达式 ,通俗叫法称之为window子句。
  unbounded 无边界
  preceding 往前
  following 往后
  unbounded preceding 往前所有行,即初始行
  n preceding 往前n行
  unbounded following 往后所有行,即末尾行
  n following 往后n行
  current row 当前行
   
  语法
  (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
  
  (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
  (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
  

知识点09:Apache Hive--窗口函数--排序函数(row_number等)

学的第二类窗口函数


![8-Apache Hive函数高阶应用、性能调优_image-2.png](/img/user/czc%E7%9F%A5%E8%AF%86%E5%BA%93/%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/9-%E9%99%84%E4%BB%B6/%E9%99%84%E4%BB%B6/8-Apache%20Hive%E5%87%BD%E6%95%B0%E9%AB%98%E9%98%B6%E5%BA%94%E7%94%A8%E3%80%81%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98_image-2.png)

解释:

![8-Apache Hive函数高阶应用、性能调优_image-3.png](/img/user/czc%E7%9F%A5%E8%AF%86%E5%BA%93/%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/9-%E9%99%84%E4%BB%B6/%E9%99%84%E4%BB%B6/8-Apache%20Hive%E5%87%BD%E6%95%B0%E9%AB%98%E9%98%B6%E5%BA%94%E7%94%A8%E3%80%81%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98_image-3.png)

再来个例子:

![8-Apache Hive函数高阶应用、性能调优_image-4.png](/img/user/czc%E7%9F%A5%E8%AF%86%E5%BA%93/%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/9-%E9%99%84%E4%BB%B6/%E9%99%84%E4%BB%B6/8-Apache%20Hive%E5%87%BD%E6%95%B0%E9%AB%98%E9%98%B6%E5%BA%94%E7%94%A8%E3%80%81%E6%80%A7%E8%83%BD%E8%B0%83%E4%BC%98_image-4.png)


```sql
  --需求:找出每个用户访问pv最多的Top3 重复并列的不考虑
  SELECT * from
  (SELECT
      cookieid,
      createtime,
      pv,
      ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS seq
  FROM website_pv_info) tmp where tmp.seq <4;

知识点10:Apache Hive--窗口函数--lag、lead函数

窗口分析函数
8-Apache Hive函数高阶应用、性能调优_image-5.png

--LAG 用于统计窗口内往上第n行值
SELECT cookieid,
       createtime,
       url,
       ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
       LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
       LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM website_url_info;


--LEAD 用于统计窗口内往下第n行值
SELECT cookieid,
       createtime,
       url,
       ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
       LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
       LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM website_url_info;

--FIRST_VALUE 取分组内排序后,截止到当前行,第一个值
SELECT cookieid,
       createtime,
       url,
       ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
       FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM website_url_info;

--LAST_VALUE  取分组内排序后,截止到当前行,最后一个值
SELECT cookieid,
       createtime,
       url,
       ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
       LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM website_url_info;

⭐czc的窗口函数总结

知识点11:Apache Hive--文件存储格式(text、ORC、parquet)

!Pasted image 20250410161300.png

⭐加载后文件大小是'18MB'

--2、创建表,存储数据格式为ORC
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;

--向表中插入数据 思考为什么不能使用load命令加载? 因为load是纯复制移动操作 不会调整文件格式。
insert into table log_orc select * from log_text;
⭐加载后文件大小是'2.78MB'

--3、创建表,存储数据格式为parquet
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;

--向表中插入数据
insert into table log_parquet select * fro m log_text ;
⭐加载进来后文件大小是'12.9MB'
```


知识点12:Apache Hive--数据压缩和文件格式搭配(ORC+snappy)


知识点13:Apache Hive--通用调优--fetch抓取机制、MR本地模式

Fetch抓取机制

在下述3种情况下 sql不走mr程序

--全局查找
select * from student;
--字段查找
select num,name from student;
--limit 查找
select num,name from student limit 2;

MapReduce本地模式

本地模式不能本质解决问题,用MR跑hive就是慢,建议切换引擎,建议用spark、tez


知识点14:Apache Hive--通用调优--join优化

底层还是MapReduce的 join 优化

下面的总结:

join的优化最终都是往Map端的join靠近,避免走reduce的join
实在不行只能用reduce的join慢慢跑,至少能跑出来

MapJoin
ReduceJoin
BucketJoin

知识点15:Apache Hive--通用调优--数据倾斜优化

数据倾斜:数据分布的不平均现象

8-Apache Hive函数高阶应用、性能调优_image-7.png


知识点16:Apache Hive--通用调优--MR程序task个数调整


知识点17:Apache Hive--通用调优--执行计划


知识点18:Apache Hive--通用调优--并行机制、推测执行机制